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1 H ybrid volume and polygon rendering with cube hardware 
Kevin Kreeger, Arie Kaufman 

July 1999 Proceedings of the ACM SIGGRAPH/ EUROGRAPHICS workshop on 
Graphics hardware 

Publisher: ACM Press 

Full text available: ^ pdfd .85 MB) Additional Information: full citation , references , citings , index terms 



Keywords: cube architecture, mixing polygons and volumes, ray casting, run-length- 
encoding, volume rendering 



2 GPGPU: general p ur pose computation on graphics hardware 

David Luebke, Mark Harris, Jens Kruger, Tim Purcell, IMaga Govindaraju, Ian Buck, Cliff 
Woolley, Aaron Lefohn 

August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes GRAPH 
•04 

Publisher: ACM Press 

Full text available: Q pdf (63.03 MB) Additional Information: full citation, abstract 

The graphics processor (GPU) on today's commodity video cards has evolved into an 
extremely powerful and flexible processor. The latest graphics architectures provide 
tremendous memory bandwidth and computational horsepower, with fully programmable 
vertex and pixel processing units that support vector operations up to full IEEE floating 
point precision. High level languages have emerged for graphics hardware, making this 
computational power accessible. Architecturally, GPUs are highly parallel s ... 

Poly g on rendering on a stream architecture 

John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery 
August 2000 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on 

Graphics hardware 
Publisher: ACM Press 

i- ii* ^ ui 0i miaga cc i^d\ Additional Information: full citation , abstract , references , citin gs, index 

Full text available: TO pdf(161 .65 KB) — * 

ye - i terms 

The use of a programmable stream architecture in polygon rendering provides a powerful 
mechanism to address the high performance needs of today's complex scenes as well as 
the need for flexibility and programmability in the polygon rendering pipeline. We describe 
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how a polygon rendering pipeline maps into data streams and kernels that operate on 
streams, and how this mapping is used to implement the polgyon rendering pipeline on 
Imagine, a programmable stream processor. We compare our resul ... 

Keywords: OpenGL, SIMD, graphics hardware, kernels, media processors, polygon 
rendering, stream architecture, stream processing, streams 



4 An overview of the BlueGene/L Supercomputer 

NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R 
Bickford, M Blumrich, AA Bright, J Brunheroto, C Ca§caval, J Castanos, W Chan, L Ceze, P 
Coteus, S Chatterjee, D Chen, G Chiu, TM Cipolla, P Crumley, KM Desai, A Deutsch, T 
Domany, MB Dombrowa, W Donath, M Eleftheriou, C Erway, J Esch, B Fitch, J Gagliano, A 
Gara, R Garg, R Germain, ME Giampapa, B Gopalsamy, J Gunnels, M Gupta, F Gustavson, S 
Hall, RA Haring, D Heidel, P Heidelberger, LM Herger, D Hoenicke, RD Jackson, T Jamal- 
Eddine, GV Kopcsay, E Krevat, MP Kurhekar, AP Lanzetta, D Lieber, LK Liu, M Lu, M Mendell, 
A Misra, Y Moatti, L Mok, JE Moreira, BJ Nathanson, M Newton, M Ohmacht, A Oliner, V 
Pandit, RB Pudota, R Rand, R Regan, B Rubin, A Ruehli, S Rus, RK Sahoo, A Sanomiya, E 
Schenfeld, M Sharma, E Shmueli, S Singh, P Song, V Srinivasan, BD Steinmacher-Burow, K 
Strauss, C Surovic, R Swetz, T Takken, RB Tremaine, M Tsao, AR Umamaheshwaran, P 
Verma, P Vranas, TJC Ward, M Wazlowski, W Barrett, C Engel, B Drehmel, B Hilgart, D Hill, F 
Kasemkhani, D Krolak, CT Li, T Liebsch, J Marcella, A Muff, A Okomo, M Rouse, A Schram, M 
Tubbs, G Ulsh, C Wait, J Wittrup, M Bae, K Dockser, L Kissel, MK Seager, JS Vetter, K Yates 
November 2002 Proceedings of the 2002 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society Press 

Full text available- jfj pdf(357 61 KB) Add'*' 0031 Information: full citation , abstract , references , c itings, index 
' ^ ' terms 

This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded 
research partnership between IBM and the Lawrence Livermore National Laboratory as 
part of the United States Department of Energy ASCI Advanced Architecture Research 
Program. Application performance and scaling studies have recently been initiated with 
partners at a number of academic and government institutions, including the San Diego 
Supercomputer Center and the California Institute of Technology. This mass ... 

5 Single-Chip MPEG-2 422P(5)HL CODEC LSI with Multi-Chip Configuration for Large 
Scale Processin g beyond HDTV Level 

Hiroe Iwasaki, Jiro Naganuma, Koyo Nitta, Ken Nakamura, Takeshi Yoshitome, Mitsuo Ogura, 
Yasuyuki Nakajima, Yutaka Tashiro, Takayuki Onishi, Mitsuo Ikeda, Makoto Endo 
March 2003 Proceedings of the conference on Design, Automation and Test in 

Europe: Designers' Forum - Volume 2 DATE '03 
Publisher: IEEE Computer Society 

Full text available: fa pdf(362.07 KB) 

Jsj Additional Information: full citation , abstract , index terms 

^ Publisher Site 

This paper proposes a new architecture for VASA, a single-chip MPEG-2 422P@HL CODEC 
LSI with multi-chip configuration for large scale processing beyond the HDTV level, and 
demonstrates its flexibility and usefulness. This architecture consists of triple encoding 
cores, a decoding core, a multiplexer/de-multiplexer core, and several dedicated 
application-specific hardware modules with a hierarchical flexible communication scheme 
for high-performance data transfer. VASA is the worldys first single ... 

6 Session P9: interactive volume rendering: Texture hardware assisted rendering of 
time-varying volume data 

Eric B. Lum, Kwan Liu Ma, John Clyne 

October 2001 Proceedings of the conference on Visualization '01 
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Publisher: IEEE Computer Society 

Full text available: gj3df (11.72 MB) Additional Information: full citation , abstract , references , citings , index 
f P Publisher Site ^^s 

In this paper we present a hardware-assisted rendering technique coupled with a 
compression scheme for the interactive visual exploration of time-varying scalar volume 
data. A palette-based decoding technique and an adaptive bit allocation scheme are 
developed to fully utilize the texturing capability of a commodity 3-D graphics card. Using 
a single PC equipped with a modest amount of memory, a texture capable graphics card, 
and an inexpensive disk array, we are able to render hundreds of time s ... 

Keywords: PC, compression, high performance computing, out-of-core processing, 
scientific visualization, texture hardware, time-varying data, transform encoding, volume 
rendering 



7 Gl-cube: an architecture for volumetric g lobal illumination and renderin g 
Frank Dachille, Arie Kaufman 

August 2000 Proceedings of the ACM SIGGRAPH/ EUROGRAPHICS workshop on 
Graphics hardware 

Publisher: ACM Press 

Full text available: t^ll pdf(650.91 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms 

The power and utility of volume rendering is increased by global illumination. We present 
a hardware architecture, Gl-Cube, designed to accelerate volume rendering, empower 
volumetric global illumination, and enable a host of ray-based volumetric processing. The 
algorithm reorders ray processing based on a partitioning of the volume. A cache enables 
efficient processing of coherent rays within a hardware pipeline. We study the flexibility 
and performance of this new architecture using both ... 

Keywords: hardware accelerator, volume processing, volume rendering, volumetric 
global illumination, volumetric ray tracing 



8 Embedded hardware desi g n case studies: A fully- pro g rammable memory 
^ management system optimizin g queue handling at multi gig abit rates 
^ G. Kornaros, I. Papaefstathiou, A. Nikologiannis, N. Zervos 

June 2003 Proceedings of the 40th conference on Design automation 

Publisher: ACM Press 

Full text available: ^ pdf(236.13 KB) Additional Information: full citation , abstract , references , index terms 

Two of the main bottlenecks when designing a network embedded system are very often 
the memory bandwidth and its capacity. This is mainly due to the extremely high speed of 
the state-of-the-art network links and to the fact that in order to support advanced quality 
of service (QoS), per-flow queueing is desirable. In this paper we describe the 
architecture of a memory manager that can provide up to lOGbs of aggregate throughput 
while handling 512K queues. The presented system supports a complete ... 

Keywords: memory management, network processor 



9 Survey and taxonomy of packet classification techniques 
David E. Taylor 

September 2005 ACM Computing Surveys (CSUR), volume 37 issue 3 
Publisher: ACM Press 
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Full text available: ^ pdf(1.60 MB) Additional Information: full citation , abstract , references , index terms 

Packet classification is an enabling function for a variety of Internet applications including 
quality of service, security, monitoring, and multimedia communications. In order to 
classify a packet as belonging to a particular flow or set of flows, network nodes must 
perform a search over a set of filters using multiple fields of the packet as the search key. 
In general, there have been two major threads of research addressing packet 
classification, algorithmic and architectural. A few pioneerin ... 

Keywords: Packet classification, flow identification 



10 Queue Management in Network Processors 

I. Papaefstathiou, T. Orphanoudakis, G. Kornaros, C. Kachris, I. Mavroidis, A. Nikologiannis 
March 2005 Proceedings of the conference on Design, Automation and Test in Europe 

- Volume 3 
Publisher: IEEE Computer Society 

Full text available: ^ pdf(140.78 KB) Additional Information: full citation , abstract 

One of the main bottlenecks when designing a network processing system is very often its 
memory subsystem. This is mainly due to the state-of-the-art network links operating at 
very high speeds and to the fact that in order to support advanced Quality of Service 
(QoS), a large number of independent queues is desirable. In this paper we analyze the 
performance bottlenecks of various data memory managers integrated in typical Network 
Processing Units (NPUs). We expose the performance limitations o ... 

Keywords: Network processor, memory management, queue management 



11 Research session: DB algorithms: Early hash join: a configurable algorithm for the 
efficie nt and earl y production of join result s 

Ramon Lawrence 

August 2005 Proceedings of the 31st international conference on Very large data 
bases VLDB '05 

Publisher: VLDB Endowment 

Full text available: ^ pdf(215.05 KB) Additional Information: full citation , abstract , references , index terms 

Minimizing both the response time to produce the first few thousand results and the 
overall execution time is important for interactive querying. Current join algorithms either 
minimize the execution time at the expense of response time or minimize response time 
by producing results early without optimizing the total time. We present a hash-based join 
algorithm, called early hash join, which can be dynamically configured at any point during 
join processing to tradeoff faster production of result ... 

12 Scalable high-speed prefix matching 
Marcel Waldvogel, George Varghese, Jon Turner, Bernhard Plattner 
November 2001 ACM Transactions on Computer Systems (TOCS), volume 19 issue 4 
Publisher: ACM Press 

Full text available: fi3 pdf(933.02 KB) Additiona! Information: full citation, abstract, references , citings, index 
^ terms 

Finding the longest matching prefix from a database of keywords is an old problem with a 
number of applications, ranging from dictionary searches to advanced memory 
management to computational geometry. But perhaps today's most frequent best 
matching prefix lookups occur in the Internet, when forwarding packets from router to 
router. Internet traffic volume and link speeds are rapidly increasing; at the same time, a 
growing user population is increasing the size of routing tables against which p ... 
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Keywords: collision resolution, forwarding lookups, high-speed networking 



13 Frequent value encoding for low power data buses 
Jun Yang, Rajiv Gupta, Chuanjun Zhang 

July 2004 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 9 Issue 3 
Publisher: ACM Press 

Full text available: ^pdf(1.79 MB) Additional Information: full citation , abstract , references , index terms 

Since the I/O pins of a CPU are a significant source of energy consumption, work has been 
done on developing encoding schemes for reducing switching activity on external buses. 
Modest reductions in switching can be achieved for data and address buses using a 
number of general purpose encoding schemes. However, by exploiting the characteristic 
of memory reference locality, switching activity on the address bus can be reduced by as 
much as 66&percnt;. Till now no characteristic has been identified ... 

Keywords: I/O pin capacitance, Low power data buses, encoding, internal capacitance, 
switching 



14 Track 5: supercomputing (parti): ELDORADO 
John Feo, David Harper, Simon Kahan, Petr Konecny 

May 2005 Proceedings of the 2nd conference on Computing frontiers 
Publisher: ACM Press 

Full text available: ^ pdf(374.11 KB) Additional Information: full citation , abstract , references , index terms 

This paper introduces Eldorado, a third generation multithreaded architecture. Previous 
Cray multithreaded systems were plagued by unreliable hardware and high costs. 
Eldorado corrects these problems by using many parts built for other commercial systems. 
Its compute processor is a 500 MHZ multithreaded processor architecturally similar to the t 
MTA-2 processor; but its interconnection network, I/O subsystem, and service processors 
are borrowed from other Cray systems. Eldorado retains the program ... 

Keywords: heterogeneous architectures, multithreaded architectures, multithreaded 
processing, performance studies 





15 Robust header compression (ROHC) in next-generation network processors 
David E. Taylor, Andreas Herkersdorf, Andreas Doring, Gero Dittmann 
August 2005 IEEE/ACM Transactions on Networking (TON), volume 13 issue 4 

Publisher: IEEE Press 

Full text available: ^ pdf(849.58 KB) Additional Information: full citation , abstract , references , index terms 

Robust Header Compression (ROHC) provides for more efficient use of radio links for 
wireless communication in a packet switched network. Due to its potential advantages in 
the wireless access area and the proliferation of network processors in access 
infrastructure, there exists a need to understand the resource requirements and 
architectural implications of implementing ROHC in this environment. We present an 
analysis of the primary functional blocks of ROHC and extract the architectural implic ... 

Keywords: ASIC, ASIP, FPGA, ROHC, hardware assist, header compression, network 
processor, reconfigurable hardware 
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computation/memory access trade-off 

Thomas Gleerup, Hans Holten-Lund, Jan Madsen, Steen Pedersen 

May 2000 Proceedings of the eighth international workshop on Hardware/software 
codesign 

Publisher: ACM Press 

Full text available- pdf(421 48 KB) Additional Information: full citation , abstract , references , citings , index 

This paper discusses the trade-off between calculations and memory accesses in a 3D 
graphics tile renderer for visualization of data from medical scanners. The performance 
requirement of this application is a frame rate of 25 frames per second when rendering 3D 
models with 2 million triangles, i.e. 50 million triangles per second, sustained (not peak). 
At present, a software implementation is capable of 3-4 frames per second for a 1 million 
triangle model. By usin ... 

Keywords: 3D graphics, case study, memory architecture 



17 Floorplanning : Floorplan assisted data rate enhancement throu g h wire pipelinin g : a |j| 

real assessment 
Mario R. Casu, Luca Macchiarulo 

April 2005 Proceedings of the 2005 international symposium on physical design 

Publisher: ACM Press 

Full text available: ^|| pdf (178.09 KB) Additional Information: full citation , abstract , references , index terms 

The recent shift towards wire pipelining (WP) mandated by technological factors has 
attracted attention towards latency-controlled floorplanning. However, no systematic 
study has been published so far that takes into account block and logic delay limitations. 
The present workaims at filling the gap by showing that blockdelay can limit and possibly 
prevent any real gain WP might promise. Recurring to adaptive WP schemes, on the other 
hand, allows relevant gains. We built floorplanner that optimiz ... 

Keywords: floorplanning, systems-on-chip, through-put, wire pipelining 




18 The TM3270 Media-Processor 

Jan-Willem van de Waerdt, Stamatis Vassiliadis, Sanjeev Das, Sebastian Mirolo, Chris Yen, 
Bill Zhong, Carlos Basto, Jean-Paul van Itegem, Dinesh Amirtharaj, Kulbhushan Kalra, Pedro 
Rodriguez, Hans van Antwerpen 

November 2005 Proceedings of the 38th annual IEEE/ACM International Symposium 
on Microarchitecture MICRO 38 

Publisher: IEEE Computer Society 
Full text available: H pdf(421. 73 KB) 



§Pub 



Additional Information: full citation, abstract 

Publisher Site 



We present the TM3270 media-processor, the latest TriMedia VLIW processor, tuned to 
address the performance demands of standard definition video processing, combined with 
embedded processor requirements for the consumer market. We discuss the architecture, 
implementation, and its first realization in a 90 nm process technology. The processor 
incorporates instruction set architectural (ISA) extensions and a load/store unit optimized 
for the video-processing domain. The ISA extensions improve the ... 

19 Full papers: Tree bitmap: hardware/software IP lookups with incremental updates 

Will Eatherton, George Varghese, Zubin Dittia 
V April 2004 ACM SIGCOMM Computer Communication Review, Volume 34 issue 2 

Publisher: ACM Press 
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Full text available: ^ pdf(189.39 KB) Additional Information: full citation , abstract , references 

Even with the significant focus on IP address lookup in the published literature as well as 
focus on this market by commercial semiconductor vendors, there is still a challenge for 
router architects to find solutions that simultaneously meet 3 criteria: scaling in terms of 
lookup speeds as well as table sizes, the ability to perform high speed updates, and the 
ability to fit into the overall memory architecture of an Level 3 forwarding engine or 
packet processor with low systems cost overhead. I ... 

20 Desi g n and Imp l e me ntat i on of Hi qh-Perfo rmance Memory Sy s te ms for Fut ure Packe t Q 
Buffers 

Jorge Garcia, Jesus Corbal, Lloreng Cerda, Mateo Valero 

December 2003 Proceedings of the 36th annual IEEE/ACM International Symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: ||| pdf(348,55 KB) Additional Information: full citation , abstract , index terms 

In this paper we address the design of a future high-speedrouter that supports line rates 
as high as OC-3072 (160 Gb/s),around one hundred ports and several service classes. 
Buildingsuch a high-speed router would raise many technological problems,one of them 
being the packet buffer design, mainly becausein router design it is important to provide 
worst-case bandwidthguarantees and not just average-case optimizations.A previous 
packet buffer design provides worst-case bandwidthguarantees by using ... 
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